This data set has been downloaded from the bureau of transportation Statistics. The file contains approximately 500000 flights that occurred, or suppose to, in March 2017. The data contains 45 variables, including information about the origin and the destination, departure delay and arriving delay, and other significant information about the planned flight. With this file is included a dictionary for each variables.
Tip: In this section, you should perform some preliminary exploration of your data set. Run some summaries of the data and create uni variate plots to understand the structure of the individual variables in your data set. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.
The 12 airline that serve in march 2017 the US internal Market
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -67.000 -5.000 -2.000 9.286 6.000 1773.000 8376
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -92.000 -14.000 -5.000 4.306 8.000 1792.000 9443
Here we can see the variables delay from departure and arriving. the negative values indicate that the flight arrived early. I was looking for differences in the shape , and actually what i think emerge from this plot is that in departure flights tent to leave early but almost all the early flights are few minutes early. As we can see we have the higher concentration just before 0. and then the shape decrease. In the arrival delay, we can actually see a distribution similar to the normal. We have more early flights and better distributed, and the delayed flights we can see a similar slope after 0.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -91.000 -12.000 -6.000 -4.936 1.000 173.000 9443
As we can see , difference ( arriving minus departure delay), has a normal shape around the mean of -4.95. So flights recover almost 5 minutes in flights.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 19.1 16.0 1773.0 399959
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 2.1 0.0 1312.0 399959
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 3.0 15.6 19.0 1186.0 399959
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 0.1 0.0 286.0 399959
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 4.0 24.3 30.0 1365.0 399959
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.0 391.0 691.0 853.7 1096.0 4983.0
The summary give us quite interesting inflammations : Interesting in this exploration is that the shortest flight is 31 miles! that 25% of the flights were less then 391miles. the longest flights is almost 5000 miles , but considering that Miami Seattle is a little more than 3000 miles, I think this could be a flight from Hawaii or the pacific state. As we can see in the plot there are actually very few flights after 3000 miles.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 8.0 62.0 101.0 118.2 150.0 712.0 9443
Here we have a similar shape clearly as distance, again we have an important information that 75% of the US internal flights are in less than 2h and half, or 150 min .
## [1] 0.01745201
As we can see in these plots, not many flights were cancelled, the 0.0175 of it, so less than the 2 % . We can also notice from the second plot that the most common reason for the cancellation was the weather (B), while National Air System (C) was the second and Carrier(A) the less common.
In these plot we can see the distribution over the days. In the first plot we can see the frequency of flights for each day of March. We can notice the pattern starting from March 1st, a Wednesday, we notice the lower frequency on Saturday. In the second plot we can confirm that the day of the week with less frequency, followed by Sunday , Tuesday. I’m quite surprise of this result, I would expect higher frequency on Monday! March has some national holiday, in fact one in Texas and one in California, one was a Friday ,one on Thursday. this could explain the higher frequency from the most flight state. Probably an investigation over 356 day will be more correct and coherent, but the files will be around 6 millions rows .
In this plot i trying to understand how many flights , with the same number, fly in a month. as we can see in the plot, the data seems quite unreal, some flights occur almost 300 times in a months. So i guess that some airliner uses the same code. For this reason I m going to create a new variable, that merge the number with the unique airline code.
Now this plot seems much more realistic, we can see a line at 31, most of the flights occur once a day every day. We have also a lot of flights that occur less , but i think this is normal, some route happened less than 7 times a week, for commercial reason. Though we still see some flights that occur 62 times, so i think that some airlines use the same code twice a day. Finally we have some extreme value , especially value over 100.
Here we can see the freq for each tail number, so for each airplane. We can not distinguish a clear patter, but we can see that there a few airplane that are over 200 as frequency(flights). While The most of it are between 100 and 200. there also airplane that did not flight much, could it be caused by a replacement, default, refurbish, etc.. there are 4 different check for an airplane and the longest normally is every 6 years and could take up to month to complete it.
## List of 1
## $ axis.text:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 5
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
Comparing the departure and arriving state we can not distinguish any significant difference. Interesting is that I would expect much more flights from New York state while in the top five there are California , Florida, Texas , Georgia and Illinois.
The x axis is not corrected , because time is expressed in base 60 , while x axis is in base 10. But we can see that there are few flights between midnight and 5 am o’clock for departure. We can also observe 3 picks, the morning, lunch and evening flights. we can see 30000 departure around 8am, that is almost 1000 per day, the same around 5 pm . For arriving things are a little different, because clearly everything is moved a bit later, we have less flights between 2am and 6 am. Also we have not higher frequency at the end of the day, almost having 4 picks in this case. From this plot we can see that airport are really calm only between 2 and 4 am.
The data set contains 488597 observation and 45 variable. Not all the variables that are presents are of interest.
the main feature of interest is the delay, especially the arriving delay.
the distance, the airport, the time In day of the week that could help me justify the delay.
I create a new variable to see each flights number, adding the unique code of the airliner and the number of the flights. I create also the variable difference that is the difference between the arriving and the departure delay.
The data set is tidy like so , but when plotting the new var flights number, it was much clear the frequency per day of flights.
I used GGpair to look for correlation between some of the variables in the data set. As we can see, exception to the obvious relations, departure and arriving delay, or distance and air time there are not important correlation.
here the plot of departure and arriving delay, as we can see their are pretty correlated obviously. Bu inserting this is the small line that we can see around 0 minutes of departure delay. Some flights that did not have delay, arrived late. probably caused by the wind adverse or the traffic at the destination airport. We can see an higher presence of flights that had more delay at landing , that one that arrived with less delay, so they do not recover easily the delay flying.
##
## Pearson's product-moment correlation
##
## data: flights_sample$DEP_DELAY and flights_sample$TAXI_OUT
## t = 14.984, df = 49112, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05865152 0.07625898
## sample estimates:
## cor
## 0.0674605
In this plot we have the taxi out time and the departure delay. We can see that most of the flight that departed without delay, actually spent more time taxing. While when in late they spent less time normally.
##
## Pearson's product-moment correlation
##
## data: flights_sample$ARR_DELAY and flights_sample$TAXI_OUT
## t = 44.834, df = 49001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1899869 0.2069972
## sample estimates:
## cor
## 0.198507
Here the same plot but with arriving delay and taxi out. we can see the shape is similar as before abut not the same. we have less flights that spent a lot of time taxing and having 0 minutes of delay. these two variables are slightly correlated, in fact we have a value of 0.19.
here the box plot of the arriving delay for each carrier. As I would never expect, most of the operator have a median that is less than 0.
here an other box plot this time per states. Ans as we can see, nobody has a median over 0.
##
## Pearson's product-moment correlation
##
## data: flights_sample$ARR_DELAY and flights_sample$DISTANCE
## t = -3.4543, df = 49001, p-value = 0.0005522
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.024453374 -0.006749718
## sample estimates:
## cor
## -0.01560277
the box plot if arriving delay and distance. As saw before there is not a correlation between them, and it seems that at every distance there are delays. there are few delays for long distance but there less flights of this kind.
same thing for departure delay.
##
## Pearson's product-moment correlation
##
## data: flights_sample$DIFFERENCE and flights_sample$DISTANCE
## t = -22.169, df = 49001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.10840772 -0.09087558
## sample estimates:
## cor
## -0.09964938
I would expect here a bigger relation , I thought that long flight could recover lost time on flight, but actually there is not so much a relation.
##
## Pearson's product-moment correlation
##
## data: flights_sample$DIFFERENCE and flights_sample$TAXI_OUT
## t = 121.74, df = 49001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4750642 0.4886602
## sample estimates:
## cor
## 0.4818912
This is a strange relation, in fact data tells us that taxi out and difference are negative correlated.
Using only the information about the delay flights, we have a less clear plots .
Box plot for carrier, give us funny information. Considering only the flights that had a delay, Hawaiian airlines appears to be the best, as it has a median delay approximately of 8. But as saw before , and considering all of his flights, Hawaiian airlines was the worst for march 2017!
This two plots give the delay for all flights and for the delayed flights, for each day of the week. we do not see an important different between the days. Probably just a lower variance on Saturday.
This two plot are just distance and air time . As we can see they say pretty much the same thing. we can distinguish the difference shape among the others of Hawaiian airline. We can assume that Hawaiian, made a lot of shorts trips , between the islands, and than long flights to the mainland.
Among the state , north Dakota and Kansans seem to have the better delays, when considered only the delays flights.
same plot buy city is impossible to read
As we look at the the box plot of the 5 differences kind of delays by carriers, we can notice that not so many were afflicted by weather, and for very low values , so i guess in march there was no huge storm or hurricane. Hawaiian airlines(HA) is the only that was almost not effected by NAS , while has the higher rate for delay caused by carrier and late aircraft with Southwest airline(WN).
Plotting the frequency of each carrier divided per day week, we can notice a similar shape with Saturday as the less frequent day. Only exception, Spirit Airlines(NK)
Finally i will group my data by origin city and also by destination city, adding the geographic of each.
## # A tibble: 6 x 12
## DEST_CITY_NAME arr_delay_mean arr_delay_median arr_delay_new_mean
## <fctr> <dbl> <dbl> <dbl>
## 1 Aberdeen, SD -8.639344 -12 3.147541
## 2 Adak Island, AK -26.000000 -27 0.000000
## 3 Aguadilla, PR 21.183824 5 25.308824
## 4 Akron, OH 9.330404 -6 16.595782
## 5 Albany, GA 3.448276 -8 11.793103
## 6 Albany, NY 3.478643 -2 9.105528
## # ... with 8 more variables: arr_delay_new_median <dbl>,
## # taxi_in_mean <dbl>, taxi_in_median <dbl>, taxi_out_mean <dbl>,
## # taxi_out_median <dbl>, n <int>, lat <dbl>, lon <dbl>
## # A tibble: 6 x 12
## ORIGIN_CITY_NAME dep_delay_mean dep_delay_median dep_delay_new_mean
## <fctr> <dbl> <dbl> <dbl>
## 1 Aberdeen, SD 0.7540984 -2 4.491803
## 2 Adak Island, AK -30.6666667 -29 0.000000
## 3 Aguadilla, PR 21.6642336 8 23.956204
## 4 Akron, OH 9.7175844 -4 14.058615
## 5 Albany, GA 1.9540230 -5 6.517241
## 6 Albany, NY 4.6834805 -2 7.779319
## # ... with 8 more variables: dep_delay_new_median <dbl>,
## # taxi_in_mean <dbl>, taxi_in_median <dbl>, taxi_out_mean <dbl>,
## # taxi_out_median <dbl>, n <int>, lat <dbl>, lon <dbl>
I will use an other map to have a better understanding
With this plot is much more clear the destination cities. Much more clear the destination like the virgin island , Puerto Rico, or the American Samoa. Also Alaska is much more clear.
Last, here the plot with only the locations for the mainland, so no Alaska and Hawaii.We can distinguish a less an higher frequency and concentration in the coasts. In the center we can distinguish 2 lines, o better two curves , throw the states that have few destinations and frequency, while they have a big surface. Some of these states have 2 or 3 destinations.
Here the two plots represents the mean of arriving delay and the mean of taxi in by city.
We can do the same process for the origin city.
I used GGpairs for the new 2 grouped data for cities.
##
## Pearson's product-moment correlation
##
## data: flights_by_dest_info$taxi_in_mean and flights_by_dest_info$n
## t = 15.489, df = 289, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6053814 0.7317956
## sample estimates:
## cor
## 0.6734831
This plot shows the relation between the frequency of flights for each airport and the taxi in mean. Higher frequency , correspond to an higher mean.
##
## Pearson's product-moment correlation
##
## data: flights_by_origin_info$dep_delay_mean and flights_by_origin_info$taxi_out_mean
## t = 6.6072, df = 289, p-value = 1.881e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2580255 0.4581568
## sample estimates:
## cor
## 0.3622592
Here the plot of the relation of taxi out mean and departure delay mean. As we see they are slightly correlated, but not as we can think. In fact, departure delay is considered the delay of the aircraft leaving the gate, later than expected. So taxi out is not the cause of the delay, but actually it can means that an aircraft delayed, loss his position , and have to wait for an other time slot to leave, in order to do not effected others aircraft on time.
Finally we can see the distribution of taxi out mean that have a shape very similar to the normal distribution.
##
## Pearson's product-moment correlation
##
## data: flights_sample$ARR_DELAY and flights_sample$DEP_TIME
## t = 28.222, df = 49001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1177481 0.1351729
## sample estimates:
## cor
## 0.1264703
This plot shows the relation between the Arriving delay and the the effective departure time. slightly correlated. we can barely see that long the day the arriving delay is seems to increase a little.
##
## Pearson's product-moment correlation
##
## data: flights_sample$DEP_TIME and flights_sample$CRS_DEP_TIME
## t = 800.84, df = 49123, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9631369 0.9643952
## sample estimates:
## cor
## 0.9637714
Finally this plot is the relation between the registered time of the depart and the actual time of the depart .
more or less there are not difference among carrier about arriving delay, for state there are some better state that have less delayed flights. Delay for cities also we couldn’t see a huge difference among them. We could notice also a small correlation between arriving delay and departure time.
I found also relation between taxi out mean and departure delay mean, per city of origin . Also between taxi in mean and the frequency of city of destination.
The correlation between taxi in mean and frequency of cities of destination , is the strongest that i found, almost 0.7. As said before this could mean that after an aircraft leave the gate , with delay, has to wait for a new departure slot, and so waste more time, than if it was on time.
##
## Pearson's product-moment correlation
##
## data: flights_by_origin_info$dep_delay_new_mean and flights_by_origin_info$taxi_out_mean
## t = 5.9805, df = 289, p-value = 6.555e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2254807 0.4304158
## sample estimates:
## cor
## 0.3318581
Here a plot with the map of the united states, and the relation between taxi out mean and the departure delay mean per each origins. As we can see the blue is more market in small origins.
##
## Pearson's product-moment correlation
##
## data: flights_by_dest_info$n and flights_by_dest_info$taxi_in_mean
## t = 15.489, df = 289, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6053814 0.7317956
## sample estimates:
## cor
## 0.6734831
Here the same plot but with destinations , and the relations between the numbers of flights and the taxi in mean. As we can see taxi in is longer with bigger airport.
##
## Pearson's product-moment correlation
##
## data: flights_by_dest_info$arr_delay_mean and flights_by_dest_info$taxi_out_mean
## t = 6.6137, df = 289, p-value = 1.811e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2583574 0.4584377
## sample estimates:
## cor
## 0.3625681
Plot by destinations , with the relation between the arriving delay mean and taxi out mean ( of the origin airport).
In this plot i can see all the line among the state, we can see that none reach every destinations. While Illinois is the only state that reach all the others states, but not the overseas territories. We have some flights that have a arriving delay mean higher, means that it is commonly late.
##
## Pearson's product-moment correlation
##
## data: flights_sample$ARR_DELAY and flights_sample$DEP_TIME
## t = 28.222, df = 49001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1177481 0.1351729
## sample estimates:
## cor
## 0.1264703
Here the plot of the relation between departure time and arriving delay , for each carrier .
After summaries the data by tail number, This plot shows, the relation between , the numbers of flights of the plane, the arriving delay mean , and the distance mean. As we can see the short distance airplane, flown more frequent over the month, but have also a slightly lower arriving delay mean. While long distance airplane , have flown less , and the delay seems to be a little bit more higher.
Here the plot of arriving delay and departure time, by day of the week, and by distance. We can not distinguish a real pattern among the days.
##
## Pearson's product-moment correlation
##
## data: flights_sample$DIFFERENCE and flights_sample$TAXI_OUT
## t = 121.74, df = 49001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4750642 0.4886602
## sample estimates:
## cor
## 0.4818912
And finally i try to plot the variable difference ( arr - dep delays) , with taxi out, comparing also Air time. there nothing interesting concerning airtime.
I was looking for variables that could justify, or at least in part, arriving delay. But i did not find any particular reason , among the flights, distance or airports. I found out that there are other relations, for example the biggest relations is among the numbers of flights per destination and the taxi in mean. This could mean, bigger airport, and so more time to reach the gate, but also more traffic, that could slow the operations. On the other hand very low related the number of flights and the taxi out, in origin airport.
I found interesting that taxi out mean and departure delay mean are related, in the opposite sense i would thought.
## flights$UNIQUE_CARRIER: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -76.000 -14.000 -5.000 3.416 8.000 1489.000 1359
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: AS
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -70.00 -12.00 -2.00 3.37 10.00 462.00 147
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: B6
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -92.000 -16.000 -5.000 9.897 16.000 839.000 1049
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: DL
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -74.000 -18.000 -10.000 -1.886 0.000 1188.000 576
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: EV
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -65.000 -15.000 -6.000 9.684 9.000 1551.000 1365
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: F9
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -55.000 -17.000 -7.000 3.818 8.000 590.000 95
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: HA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -61.000 -6.000 -1.000 3.738 7.000 1133.000 30
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: NK
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -65.000 -14.000 -5.000 9.451 11.000 1365.000 340
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: OO
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -60.000 -14.000 -5.000 8.018 8.000 1792.000 1300
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: UA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -64.000 -17.000 -8.000 2.873 6.000 1491.000 731
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: VX
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -62.00 -8.00 3.00 15.42 24.00 399.00 156
## --------------------------------------------------------
## flights$UNIQUE_CARRIER: WN
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -57.000 -10.000 -2.000 4.173 9.000 528.000 2295
I chose this plot because, I would never thought, from my experience, that carriers could maintain such good standard. As we can see they all maintain their delay very low, the ‘worst’ median is 3 and the ‘worst’ mean is 15.42 for Virgin America(VX) . There are some extremes low values, and planes that arrived long before the scheduled time. There are also extreme high values, the longest delay was almost 30 hours for Sky West Airlines (OO). Delta kept 75% of his flights on time.
In this scatter plot, we can the distributions of all the airplanes that have flown in March 2017. As we can see short the distance mean of the flights,
airplane flown more times. We can notice that for airplanes that have flown between 100 and 300, so that have flown lower distance flights, the arriving delay is concentrated between -50 and 100. The dispersion is much more elevated for airplanes that have flown less than 100 times and have flown longer distance.
##
## Pearson's product-moment correlation
##
## data: flights_by_origin_info$dep_delay_new_mean and flights_by_origin_info$taxi_out_mean
## t = 5.9805, df = 289, p-value = 6.555e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2254807 0.4304158
## sample estimates:
## cor
## 0.3318581
In this plot we can see all the Origin city over the Us. We can see that for an higher departure delay mean for each city, correspond often to an higher taxi out mean. Al thought the delay is not to be considered by a longer taxi out, in fact, delay is calculated when the airplane leave the gate. So what this plot really says is that, when an airplane leave the airport with delay, is probable that is going to wait longer on the tarmac before leaving. The reasons behind this is that every flights have a takeoff time slot, and if it miss it , it will have to wait for new available slot, without delaying others airplanes. So when flights get delayed, it get more difficult to recover form that delay. ——
Data were imported as they were downloaded, so no correction has been made. During the analysis i create 2 new variables, Difference , that is the difference between Arriving and departure delay, and also flight numbers, merging the unique airline Id and the flight number ( 4 digits ). Along the study I regrouped the data by origin city, but also by destination and finally by tail number to analyse each airplane. At beginning of the analysis I look over most of the variables, and look for particular shape and pattern. Later I focus my attention to the feature of main interest that was the arriving delay . All along I tried to find some specific other variable that could, in part, make me understand better the arriving delay. This research was made by plotting the different variables together or analyzing the Pearson’s Correlation. Finally I plot more variables together, looking for particular shape and pattern.
I think that the biggest issue with this data was the quantity of it. I had to start working with a small sample, because 500k were a lot of flights. I was happy anyway to give meaning to all these data, for example in the last plot showed.
I would love also to have the model of the aircraft, and more commercial information like passengers on board an so on. For further investigation , I would love to work at the same data , with another data set of the weather condition, or the wind condition, at higher level.